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Abstract —Pharmacovigilance is the field of science devoted to 
the collection, analysis, and prevention of Adverse Drng Reac¬ 
tions (ADRs). Efficient strategies for the extraction of information 
abont ADRs from free text sources are essential to support the im¬ 
portant task of detecting and classifying unexpected pathologies, 
possibly related to (therapy-related) drug use. Narrative ADR 
descriptions may be collected in different ways, e.g., either by 
monitoring social networks or through the so called “spontaneous 
reporting, the main method pharmacovigilance adopts in order 
to identify ADRs. The encoding of free-text ADR descriptions 
according to MedDRA standard terminology is central for report 
analysis. It is a complex work, which has to be manually 
implemented by the pharmacovigilance experts. The manual 
encoding is expensive (in terms of time). Moreover, a problem 
about the accuracy of the encoding may occur, since the number 
of reports is growing up day by day. In this paper, we propose 
MagiCoder, an efficient Natural Language Processing algorithm 
able to automatically derive MedDRA terminologies from free- 
text ADR descriptions. MagiCoder is part of VigiWork, a 
web application for online ADR reporting and analysis. From 
a practical point of view, MagiCoder reduces the encoding time 
of ADR reports. Pharmacologists have simply to review and 
validate the MedDRA terms proposed by MagiCoder, instead 
of choosing the right terms among the 70K terms of MedDRA. 
Such Improvement in the efficiency of pharmacologists’ work 
has a relevant impact also on the quality of the following data 
analysis. 

Our proposal is based on a general approach, not depending 
on the considered language. Indeed, we developed MagiCoder 
for the Italian pharmacovigilance language, but preliminarily 
analyses show that it is robust to language and dictionary 
changes. 

Index Terms —pharmacovigilance; natural language process¬ 
ing; adverse reaction entry. 

I. Introduction 

Pharmacovigilance includes all activities aimed to system¬ 
atically study risks and benefits related to the correct use of 
marketed drugs. The development of a new dmg, which begins 
with the production and ends with the commercialization of 
a drug, considers both pre-clinical studies (usually tests on 
animals) and clinical studies (tests on patients). After these 
phases, a pharmaceutical company can require the authorization 
for the commercialization of the new drug. Notwithstanding, 
whereas at this stage drug benefits are well-know, results about 
drug safety are not conclusive ||TJ. The pre-marketing tasks 
cited above have some limitations: they involve a small number 
of patients; they exclude relevant subgroups of population 
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such as children and elders; the experimentation period is 
relatively short, less than two years; the experimentation does 
not deal with possibly concomitant pathologies, or with the 
concurrent use of other drugs. For all these reasons, non¬ 
common Adverse Drug Reactions (ADRs), such as slowly- 
developing pathologies (e.g., carcinogenesis) or pathologies 
related to specific groups of patients, cannot be discovered 
before the commercialization. It may happen that drugs are 
withdrawn from the market after the detection of unexpected 
collateral effects. Thus, it stands to reason that the control of 
ADRs is a necessity, considering the mass production of dmgs. 
As a consequence, pharmacovigilance plays a crucial role in 
human healthcare improvement ||T]. 

Spontaneous reporting is the main method pharmacovigilance 
adopts, in order to identify adverse drug reactions. Through 
spontaneous reporting, health care professionals, patients, and 
pharmaceutical companies can voluntarily send information 
about suspected ADRs to the national regulatory authority 
The spontaneous reporting is an important activity. It provides 
pharmacologists and regulatory authorites with early alerts, 
by considering every drug on the market and every patient 
category. 

The Italian system of pharmacovigilance requires that in each 
local health structure there is a qualified person responsible 
for pharmacovigilance. Her/his assignment is to collect reports 
of suspected ADRs and to send them to the National Network 
of Pharmacovigilance (RNF) within seven day^ Once reports 
have been notified and sent to RNF, currently through a web 
application, they are analysed by both local pharmacovigilance 
centres and by the Dmg Italian Agency (AIFA). Subsequently, 
they are sent to Eudravigilance Q and to VigiBase ||^ (the 
european and the worldwide pharmacovigilance network, RNF 
is part of, respectively). In general, spontaneous ADR reports 
are filled by health care professionals (medical specialists, 
general practitioners, nurses, and so on), but also by citizens. 
In the last years, Italian ADR reports have grown exponentially, 
going from approximately ten thousand in 2006 to around sixty 
thousand in 2014, as shown in Figure 

Since the post-marketing surveillance of drugs is of 

*in Italy, the Drug Italian Agency AIFA -Agenzia Italiana del FArmaco, 
http://www.agenziafarmaco.gov.it/ 

^According to the Italian Law, Art. 132 of Legislative Decree Number 219 
of 04/24/2006. 



Fig. 1. The yearly increasing number of reports about suspected adverse reactions induced by drugs in Italy. 


paramount importance, such an increase is certainly positive. 
At the same time, the manual review of reports became 
difficult and often unbearable both by people responsible 
for pharmacovigilance and by regional centres. Indeed, each 
report must be checked, in order to control its quality; it is 
consequently encoded and transferred to RNF via “copy by 
hand” (actually, a printed copy). 

Recently, to increase the efficiency in collecting and manag¬ 
ing ADR reports, a web application, called VigiWork, has 
been designed and implemented for the Italian pharmacovigi¬ 
lance (at https://vigiwork.vigifarmaco.it/). Through VigiWork, 
a spontaneous report can be inserted online both by healthcare 
professionals and by citizens (through different forms), as 
anonymous or registered users. VigiWork is user-friendly. 
The user is guided in compiling the report, since it has to 
be filled step-by-step (each phase corresponds to a different 
report section, i.e., “Patient”, “Adverse Drug Reaction”, “Drug 
Treatments” and “Reporter”, respectively). Inserted data are 
then validated, since a report can be successfully sent only 
after completing the correct sequence of steps. 

VigiWork is also useful for pharmacovigilance supervisors. 
Indeed, VigiWork reports are high-quality documents, since 
they are automatically validated (the presence, the format, 
and the consistency of data are validated at the filling time). 
As a consequence, they are easier to review (especially with 
respect to printed reports). Moreover, thanks to VigiWork, a 
pharmacologist can send reports to RNF by simply pressing a 
button, after reviewing it. 

Online reports have grown up to become the 30% of the total 
number of Italian reports. As expected, it has been possible to 
observe that the average time between the dispatch of online 
reports and the insertion into RNF is sensibly shorter with re¬ 
spect to the the insertion from printed reports. Notwithstanding, 
there is an operation which still requires the manual work of 
people responsible for Pharmacovigilance also for online report 
revisions: the encoding in MedDRA terminology of the free text, 
through which the reporter describes one or more adverse drug 
reactions. The description of a suspected ADR through narrative 
text could seem redundant/useless. Indeed, one could reasonably 
imagine sound solutions based either on an autocompletion 


form or on a menu with MedDRA terms. In these solutions, the 
description of ADRs would be directly encoded by the reporter 
and no expert work for MedDRA terminology extraction would 
be required. However, such solutions are not completely suited 
for the pharmacovigilance domain and the narrative description 
of ADRs remains a desirable feature, for at least two reasons. 
First, the description of an ADR by means of one of the seventy 
thousand MedDRA terms is a complex task. In most cases, the 
reporter which points out the adverse reaction is not an expert in 
MedDRA terminology. This holds in particular for citizens, but 
it is still valid for several professionals. Thus, describing ADRs 
by means of natural language sentences is simpler. Second, 
the choice of the suitable term(s) from a given list or from 
an autocompletion field can influence the reporter and limit 
her/his expressiveness. As a consequence, the quality of the 
description would be also in this case undermined. Therefore, 
VigiWork offers a free-text form for specifying and ADR 
with all the possible details, without any restriction about the 
content or limits to the length of the written text. Consequently, 
MedDRA encoding has then to be manually implemented by 
qualified people responsible for pharmacovigilance, before the 
transmission to RNF. As this work is expensive in terms of 
time and attention required, a problem about the accuracy of 
the encoding may occur given the continuous growing of the 
number of reports. 


According to the described scenario, in this paper we 
propose MagiCoder, a natural language processing (NLP) Q 
algorithm, which automatically assigns one or more MedDRA 
term codes to each narrative ADR description in the online 
reports collected by VigiWork. 


The paper is organized as follows. In Section]^ we provide 
some background notions and we discuss related work. In 
Section III we present the algorithm MagiCoder, by providing 


both a qualitative description and the pseudocode. In Section IV 


we spend some words about the user interface, we explain the 
benchmark we developed to test MagiCoder performances and 
we discuss first results. Finally, in Section [V] we discuss the 
main features of our work and sketch some future research 
lines. 







II. Background and Related Work 


A. Natural Language Processing and Text Mining in Medicine 

Automatic detection of adverse drug reactions from text 
recently received an increasing interest in pharmacovigilance 
research. Narrative descriptions of ADRs come from hetero¬ 
geneous sources: spontaneous reporting. Electronic Health 
Records, Clinical Reports, and social media. In Q-Q some 
NLP approaches have been proposed for the extraction of 


ADRs from text. In 1101, the authors collect narrative discharge 
summaries from the Clinical Information System at New York 
Presbyterian Hospital. MedLEE, an NLP system, is applied to 
this collection, to identify medication events and entities, which 
could be potential adverse drug events. Co-occurrence statistics 
with adjusted volume tests were used to detect associations 
between the two types of entities, to calculate the strengths 
of the associations, and to determine their cutoff thresholds. 
In pT| , the authors report on the adaptation of a machine 
learning-based system for the identification and extraction of 
ADRs in case reports. The role of NLP approaches In optimised 
machine learning algorithms is also explored in p^ , where the 
authors address the problem of automatic detection of ADR 
assertive text segments from distinct sources, focusing on data 
posted by users on social media (Twitter and DailyStrenght, a 
health care oriented social media). Existing methodologies for 
NLP are discussed; an experimental comparison between NLP- 
based machine learning algorithms over data sets from different 
sources has been proposed. Moreover, the authors address the 
issue of data imbalance for ADR description task. In fO) the 
authors propose to use association mining and Proportional 
Reporting Ratios (PRR, a well-know pharmacovigilance statis¬ 
tical index) to mine the associations between dmgs and adverse 
reactions from the user contributed content in social media. 
In order to extract adverse reactions from on line text (from 
health care communities), the authors apply the Consumer 
Health Vocabulary (at http://www.consumerhealthvocab.org) 
to generate ADR lexicon. ADR lexicon is a computerized 
collection of health expressions derived from actual consumer 
utterances (authored by consumers), linked to professional 
concepts and reviewed and validated by professionals and 
consumers. Narrative text is preprocessed following standard 
NLP techniques (such as stop word removal, see Section III-A| i. 
An experiment using ten drugs and five adverse dmg reactions 
is proposed. The Pood and Dmg Administration alerts are used 
as the gold standard, to test the performance of the proposed 
techniques. The authors developed algorithms to identify ADRs 
from threads of drugs, and implemented association mining 
to calculate leverage and lift for each possible pair of drugs 
and adverse reactions in the dataset. At the same time, PRR is 
also calculated. 

Other interesting papers about pharmacovigilance and ma¬ 
chine learning or data mining are, e.g., fM) and [TS) . In 
a text extraction tool is implemented on the .NET platform 
with functionalities for preprocessing text (removal of stop 
words. Porter stemming and use of synonyms) and matching 
medical terms using permutations of words and spelling 


MedDRA Level 

MedDRA Term 

SOC 

Skin disorders 

HLGT 

Epidermal conditions 

HLT 

Dermatitis and Eczema 

PT 

Asteatotic Eczema 

LLT 

Itch 


TABLE I 

MedDRA Hierarchy - an Example 


variations (Soundex, Levenshtein distance and Longest common 
subsequence distance 1(13). Its performance has been evaluated 
on both manually extracted medical terms from summaries of 
product characteristics and unstructured adverse effect texts 
from Martindale (i.e. a medical reference for information about 
drugs and medicines) using the WHO-ART and MedDRA 
medical term dictionaries. A lot of linguistic features have 
been considered and a careful analysis of performances has 
been provided. 

B. MedDRA Dictionary 

The Medical Dictionary for Regulatory Activities (MedDRA) 
is a medical terminology used to classify adverse event infor¬ 
mation associated with the use of biopharmaceuticals and other 
medical products (e.g., medical devices and vaccines). Coding 
these data to a standard set of MedDRA terms allows health 
authorities and the biopharmaceutical industry to exchange and 
analyze data related to the safe use of medical products fTS) . 
It has been developed by the International Conference on 
Harmonization (ICH); it belongs to the International Lederation 
of Pharmaceutical Manufacturers and Associations (ILPMA); 
it is controlled and periodically revised by the MedDRA 
Mainteinance And Service Organization (MSSO). MedDRA 
is available for eleven European languages and for Chinese 
and Japanese too. It is updated twice a year (in March 
and in September), following a collaboration-based approach: 
everyone can propose new reasonable updates or changes 
(as effects of events as the onset of new pathologies) and 
a team of experts eventually decides about the publication of 
the updates. MedDRA terms are organised into a hierarchy: 
the SOC (System Organ Classes) level includes the most 
general terms; the LLT (Low Level Terms) level includes 
more specific terminologies; between SOC and LLT there are 
three intermediate levels (HLGT, HLT and PT). 

Table |I] shows an example of the hierarchy: the reaction Itch 
is described starting from Skin disorders. Epidermal conditions. 
Dermatitis and Eczem, and Asteatotic Eczema. Preferred Terms 
are Low Level Terms chosen to be the representative of a group 
of terms. It should be stressed that the hierarchy is multiaxial: 
for example, a PT (Preferred Term) can be grouped in one 
or more HLT (High Level Term), but it belongs to only one 
primary SOC (System Organ Class) term. 

The encoding of ADRs through MedDRA is extremely 
important for report analysis as for a prompt detection of 
problems related to drug-based treatments. Thanks to MedDRA 













it is possible to group similar/analogous cases described in 
different ways (e.g. by synonyms) or with different details/levels 
of abstraction. 

III. MagiCoder: an Algorithm for ADR Automatic 
Encoding 

A natural language ADR description is a completely free 
text. The user has no limitations, she/he can potentially write 
everything: a number of online ADR descriptions actually 
contain information not directly related to drug effects. An 
NLP software has to face and solve many issues: trivial 
orthographical errors; the use of singular versus plural nouns; 
the so called “false positives” i.e. syntactically retrieved 
inappropriate results, which are closely resembling correct 
solutions; the structure of the sentence, i.e. the way an assertion 
is built up in a given language. Also the “intelligent” detection 
of linguistic connectives is a crucial issue. For example, the 
presence of a negation can potentially change the overall 
meaning of a description. 

In general, a satisfactory automatization of human reasoning 
and work is a subtle task, and the uncontrolled extension of 
the dictionary with auxiliary synonymous or the naive ad- 
hoc management of particular cases can limit the efficiency 
of the algorithm. For these reasons, we carefully designed 
MagiCoder, even through a side-by-side collaboration between 
pharmacologists and computer scientists, in order to yield 
an efficient tool, capable to really support pharmacovigilance 
activities. 

In literature, several NLP algorithms still exists, and several 
interesting approaches (such as the so called morpho-analysis 
of natural language) have been studied and proposed Q, | |T9) , 
m- According to the described pharmacovigilance domain, 
we considered algorithms for the morpho-analysis and the 
part-of-speech extraction techniques 0^G9| too powerful and 
general purpose for the first solution to our problem. 

Thus, we decided to design and develop an ad-hoc algorithm 
for the problem we are facing, namely that of deriving MedDRA 
terms from narrative text and mapping segments of text in 
effective LLT terms. This task has to be done in a very feasible 
time (we want that each interaction user/MagiCoder requires 
less than a second) and the solution offered to the expert has 
to be readable and useful. Therefore, we decided to ignore the 
structure of the narrative description and address the issue in a 
simpler way. Main features of MagiCoder can be summarized 
as follows; 

• it requires a single linear scan of the narrative description: 
as a consequence, our solution is particularly efficient in 
terms of computational complexity; 

• it has been designed and developed for the specific 
problem of mapping Italian text to MedDRA dictionary, 
but we claim the way MagiCoder has been developed is 
sound with respect to Language and Dictionary changes. 

• the current version of MagiCoder is only based on the 
pure syntactical recognition of the text and it does not 
exploit any heuristic or external synonym dictionary; as it 
will be discussed in Section [TV| experimental results are 


encouraging and we empirically observed that the use of 
an external dictionary produces a relevant improvement 
of performances. 

A. MagiCoder: Overview 

The main idea of MagiCoder is that a single linear scan of 
the free-text is sufficient, in order to recognize MedDRA terms. 

From an abstract point of view, we try to recognize, in 
the narrative description, single words belonging to LLTterms, 
which do not necessarily occupy consecutive positions in the 
description. This way, we try to reconstruct MedDRA terms, 
taking into account the fact that in a description the reporter can 
permute or omit words. As we will show, MagiCoder has not 
to deal with computationally expensive tasks, such as taking 
into account subroutines for permutations and combinations of 
words (as, for example, in p6)). 

We can distinguish five phases in the procedure, we will 
discuss in detail in the following; 

1) Preprocessing of the original text; 

2) Definition of ad-hoc data structures; 

3) Word-by-word linear scan of the description and “voting 
task”; 

4) Weight calculation and sorting of voted terms; 

5) Winning terms release. 

1) Preprocessing of the original ADR description: Given 
a natural language ADR description, the text has to be 
preprocessed in order to perform an efficient computation. We 
adopt well-know techniques such as tokenization pT] , where a 
phrase is reduced to tokens, i.e. syntactical units, which often, 
as in our case, correspond to words. A tokenized text can be 
easily manipulated as an enumerable object, e.g. an array. A 
stop word is a word which can be considered irrelevant for 
the text analysis (e.g. an article or an interjection). In this first 
release of our software we decided to not take into account 
connectives, e.g. conjunctions, disjunctions, negations. Once 
one has defined the set of the stop words, the original text is 
cleaned from such irrelevant words. 

A fruitful preliminary work is the extraction of the corre¬ 
sponding stemmed version from the original tokenized (and 
stop-word free) text. Stemming is a linguistic technique that, 
given a word, reduces it to a particular kind of root form 
It is useful in text analysis, in order to avoid problems such 
as bad word recognition due to singular/plural forms (e.g., 
hand/hands). Stemming is also potentially harmful, since it can 
generate the so called “false positives” terms. A meaningful 
example can be found in Italian language. The plural of the 
word mano (in English, hand) is mani (in English, hands), and 
their stemmed radix is man, which is also the stemmed version 
of mania (in English, mania). 

2) Definition of ad hoc data structures: The algorithm 
proceeds with a word-by-word comparison. We iterate on the 
preprocessed text and we test if a single word w (a token) 
occurs into one or many LLT terms. 

In order to efficiently test if a token belongs to one or 
more LLT terms, we need to create a further level of the 
MedDRA dictionary. The LLT level of MedDRA is actually a 


set of phrases, i.e. sequences of words. By scanning these 
sequences, we built a meta-dictionary of all the words which 
compose LLT terms. As we will describe in Section |III-B 
in 0{mk) time units (where m and k are the cardinality of 
the set of LLT terms and the length of the longest LLT term 
in MedDRA, respectively) we can build a hash table having 
all the words occurring in MedDRA as keys, where the value 
associated to key Wi contains information about the set of 
LLTs containing Wi. This way, we can verify the presence in 
MedDRA of a word w encountered in the ADR description 
in constant time. We call this meta-dictionary DictByWord. 
We build a meta dictionary also from a stemmed version of 
MedDRA, to verify the presence of stemmed descriptions. We 
call it DictByWordStem. 

Also the MedDRA dictionary is loaded for the computation 
into hash tables and, in general, all our main data structures 
are dictionaries. We aim to stress that, to retain efficiency, we 
preferred exact string matching with respect to approximate 
string matching, when looking for a word into the meta 
dictionary. Approximate string matching would allow us to 
retrieve terms that would be lost in exact string matching (e.g., 
we could recognize misspelled words in the ADR description), 
but it would worsen the performances of the text recognition 
tool, since direct access to the dictionary would not be possible. 
We discuss the problem of addressing orthographical errors in 
Section lYl 

3) Word-by-word linear scan of the description and voting 
task: Algorithm MagiCoder scans the text word-by-word 
(remember that each word corresponds to a token) one time and 
performs a “voting task”: at the z-th step, it marks (i.e. “votes”), 
with index i each LLT term t containing the current (i-th) word 
of the ADR description. Moreover, it keeps track of the position 
where the z-th word occurs in LLT terms. MagiCoder tries 
to find a word match both for the exact and the stemmed 
version of the meta dictionary and keeps track of the kind of 
match it has eventually found. It updates a flag, initially set to 
0, if at least a stemmed matching is found. If a word w has 
been exactly recognized in a term t, the match between the 
stemmed versions of w and t is not considered. At the end of 
the scan, the procedure has built a sub-dictionary containing 
only terms “voted” at least by a word. We will call VotedLLi 
the sub dictionary of voted terms. 

Each selected term t is equipped with two auxiliary data 
structures, containing, respectively: 

1) the positions of the voting words in the ADR description; 
we will call voterst this sequence of indexes; 

2) the positions of the voted words in the MedDRA term f; 
we will call votedj this sequence of indexes. 

Moreover, we endow each selected term with a third structure 
that will contain the sorting criteria we dehne below; we will 
call it weights^. 

Let us now introduce some notations we will use in the 
following. We denote as t.size the function that, given a LLT 
term t, returns the number of words contained in t. We denote as 
voterst-length (resp. votedt-length) the function that returns 
the number of indexes belonging to voterst (resp. votedt). 


We denote as voterst.mm and voterst.maa; the functions that 
return the maximum and the minimum indexes in voters*, 
respectively. 

4) Weight calculation and sorting: After the voting task, 
selected terms have to be ordered. Notice that a purely 
syntactical recognition of words in LLT terms potentially 
generates a large number of voted terms. So we have to: i) 
filter a subset of highly feasible solutions; ii) choose a good 
hnal selection criteria (this will be discuss in Section |III-A5| ). 

To this end, we dehne hve criteria as “weights” to assign to 
voted terms. In the following, :r ^.— is a normalization factor 
(w.r.t. the length, in terms of words, of the LLT term t). For 
the hrst four criteria the optimum value is 0. 

Criterion one: Coverage 

First, we consider how many words of each voted 
LLT term have been recognized. 

, t.size — s/oterst-length 


Criterion two: Type of Coverage 

The algorithm considers whether a perfect matching 
has been performed using or not stemmed words. 
C 2 (-) is simply a hag. C 2 {t) holds if stemming has 
been used at least once in the voting procedure. 

Criterion three: Coverage Distance 

The use of stemming allows one to hnd a number of 
(otherwise lost) matches. As side effects, we often 
obtain a quite large set of joint winner candidate 
terms. In this phase, we introduce a string distance 
comparison between recognized words in the original 
text and retrieved LLT terms. Among the possible 
string metrics, we use the so called pair distance | [22) , 
which is robust with respect to word permutation. So, 

C 3 (f) = pair(t,i) 

where pair{s, r) is the pair distance function (be¬ 
tween strings s and r) and t is the term “rebuilt” 
from the words in ADR description corresponding to 
indexes in voters*. 

Criterion four: Coverage Density 

We want to estimate how an LLT term has been 
covered. 


(voters*.77zaa; — voters*.mzTz) -f 1 


The intuitive meaning of the criterion is to quantify 
the “quality” of the coverage. If an LLT term has been 
covered by nearby words, it will be considered a good 
candidate for the solution. This Criterion has to be 
carefully implemented, taking into account possible 
duplicate words. 

Criterion Five: Coverage Distribution 

After the evaluation and the sorting by the criteria 
described above, good solutions are sorted in the 
first positions. We add a further criterion, the only 










one based on the assumptions we made about the 
structure of (Italian) sentences. The following formula 
simply sums the index of the covered words for t G 
VotedLLi: 


voted t. I ength — 1 

C 5 (t) = votedt[i] 

i=0 


If C 5 (t) is small, It means that words in the hrst posi¬ 
tions of term t have been covered. We introduce this 
criterion to discriminate between possibly joint win¬ 
ning terms. Indeed, an Italian medical description of 
a pathology has frequently the following shape: name 
of the pathology+ “location ” or adjective. Intuitively, 
we privilege terms, for which the recognized word(s) 
are probably the one(s) describing the pathology. 
After computing (and storing) the weights 
related to the above criteria, for each 

voted term t we have the structure 

weights^ = [Ci(f),C 2 (f),C 3 (f),C 4 (f),C 5 (f)], 

containing the weights corresponding to the five 
criteria. 

We finally proceed by ordering voted terms by 
multiple value sorting (on elements in w/eights^, 
t G VotedLLr) and call Sorted Voted llt the sorted 
dictionary. 


5) Release of winning terms: In order to provide an effective 
support to pharmacovigilance experts’ work, it is important 
to offer, among the “good” solutions of the algorithm (well 
positioned LLT terms in sorted output), a small subset of 
candidate solutions, typically from one to six terms recognized 
as the best match of the ADR description. We will call 
SelectedLLT such a set. This is a subtle task. As previously 
said, the pure syntactical recognition of MedDRA terms into a 
free-text generates a possibly large set of syntactically good 
results. Therefore, the releasing strategy has to be carefully 
designed. The main idea is to select and return a subset of 
voted terms which “covers” the ADR description. We iterate 
on the ordered dictionary and for each t G Sorted Voted llt we 
iterate on voters* and we select t if the following conditions 
hold: 1) t does not belong to SelectedLLT; 2) t is not a prehx of 
another selected term t' G VotedLLi 1 3) for any Wi G voters*, 
Wi has not been covered or m* has not been exactly covered 
(only the stemmed version has been eventually recognized) 
or t has been “voted” without stemming. We keep track of 
the words of the ADR description covered by the selection. 
We consider all the sorted dictionary Sorted Voted lltj but the 
selection actually ends when all the words of the description 
have been covered. The user interface (UI) of VigiWork 
(described in Section IVI further filters winning terms, by 
releasing from zero up to six solutions. 


^In the implementation we add also the following thresholds: we choose 
only terms f such that Cgf) < 0.5 and Cs(t) < 3. We extracted these 
threshold hy means of some empirical tests. We plain to eventually adjust 
them after some further performance tests 


In Magi Coder we do not need to consider ad hoc subroutines 
to address permutations and combinations of words (as it is 
done, for example, in 1161). In Natural Language Processing, 
permutations and combinations of words are important, since 
in spoken language the order of words can change w.r.t. the 
formal stracture of the sentences. Moreover, some words can 
be omitted, while the sentence still retains the same meaning. 
These aspects come for free from our voting procedure: after 
the scan, we retrieve the information that a set of words covers 
a term t G VotedLLi, but the order between words does not 
matter. 


B. MagiCoder: the Algorithm 

Figure depicts the pseudocode of MagiCoder. Here 
we provide a high-level description of the procedure. We 
represent dictionaries either as sets of words or as sets of 
functions. As usually, the formula w G LLTDict means 
“word w belongs to dictionary DictByWord (similarly for 
DictByWordStem, VotedLLi, SortedVotedLLj, SelectedLii)- 
Procedure Preprocessing takes the narrative description, 
puts it into an array of words and performs tokenization 
and stop-word removal. Procedures CreateMetaDict and 
CreateMetaDictStem get the dictionary of LLT terms 
and create a dictionary of words and of their stemmed 
versions, respectively, which belong to LLT terms, retain¬ 
ing the information about the set of terms containing each 
word. By the functional notation DictByWord(j) (similarly, 
DictByWordStem(j)), we refer to the set of LLT terms 
containing the word j (or its stemmed version). Function 
stem{i) returns the stemmed version of word i. Fnnction 
indxt{j) returns the position of word j in term t. stem_usaget 
is a flag, initially set to 0 , which holds 1 if at least a stemmed 
matching with the MedDRA term t is found. adr_clear, voters*, 
voted* are arrays and add[A, Z] denotes the insertion of I in 
array A, where I is an element or a sequence of elements. 

C* (z = 1,..., 5) are criteria defined in Section |III-A4| and 
procedure sortby{vi,... ,Vk) performs the multi-value sorting 
of values vi,...,Vk. Procedure prefix{S,t), where S' is a 
set of terms and f is a term, tests whether t (considered 
as a string) is prefix of a term in S. Dually, procedure 
remove_prefix{S,t) tests if in S there are one or more 
prefixes of t, and eventually remove them from S. Function 
mark{j) specifies whether a word j has been aheady covered 
in the (partial) solution during the term release: mark{j) holds 
1 if j has been covered (with or without stemming) and it holds 
0 otherwise. We assume that before starting the final phase 
of building the solution (i.e., the returned set of LLT terms), 
mark{j) = 0 for any word j belonging to the description. 

Let us now conclude this section by sketching the analysis 
of computational complexity of MagiCoder. Let n be the input 
size (the length, in terms of words, of the ADR description). 
Let m be the cardinality of the medical dictionary (i.e., the 
number of terms). Moreover, let m! be the number of words 
occurring in the dictionary and let k be the length of the 
longest t G LLT. For MedDRA, we have around 70K terms 
and 20K words. Notice that A: is a very small constant. We 






Procedure MagiCoder(_D description, LLTDict dictionary) 

Input: D: the narrative description; LLTDict: a data structure containing the LLT terms of MedDRA dictionary 
Output: a set of LLT ordered terms 
DictByWord = CreateMetaDict( LLT Diet); 

DietByWordStem = CreateStemMetaDict(LLTDict); 
adr_clear = Preprocessing(D); 
adr_length = adr_clearlsnph\ 
foreach (i G [0, adr_length — 1] do 


/* test whether the current word belongs to MedDRA 7 

if adr_clear[i] G DictByWord then 

/* for each term containing the word 7 

foreach (t G DictByWord(adr_c/ea/'fiJ) do 

/* keep track of the index of the voting word 7 

add[voterst,i]; 

/* keep track of the index of the recognized word in t 7 

add [voted t, indxt (adr_clear[i ])]; 

VotedLLT = VotedLLT Ut; 

r test if the current (stemmed) word belongs the stemmed MedDRA 7 

if stem(adr_clear[i]) G DietByWordStem then 

foreach t G DictByWordStemfjremfar/r_cZear77)) do 

/* test if the current term has not been exactly voted by the same word 7 

if i ^ voters* then 
add [voters*, i]; 

add [voted *, indxt (adr_clear[i 7)] ; 

/* keep track that t has been covered by a stemmed word 7 

stemjusaget = true; 

L VotedtLT = VotedLLT U t 

/* for each voted term, calculate the five weights of the corresponding criteria 7 

foreach t G VotedLLT do 
|_ add[weightS(, Ci(f), C2(t), Cait), C4(f), C5(t)] 

/* multiple value sorting of the voted terms 7 

Sorted Voted LLT = VotedLLT-sortbyfCi, C2, C3, C4, C5); 
foreach t G SortedVotedLLx do 
foreach index G voters* do 


/* select a term t if its i-th voting word has not been covered or if its i-th voting word has been perfectly recognized in t 
and if t is not prefix of another already selected terms 7 

if ((stemjusaget = false OR (mark(adr_clear(index))=0)) AND t ^ SelectedLLr prefix(Se\ectedLLT,t)=false) then 
mark(adr_clear(index))= 1; 

/* remove from the selected term set all terms which are prefix of t 7 

SelectedLLT = remove_prefix(SelectedLLT,t); 

L SelectedLLT = SelectedLLTU t 

retnrn SelectedLLT 


Fig. 2. Pseudocode of MagiCoder 


assume that all update operations on auxiliary data structures 
require constant time. Building meta-dictionaries DictByWord 
and DietByWordstems requires 0(mk) time units. In fact, the 
simplest procedure to build the hash table is to scan the LLT 
dictionary and, for each term t, to verify for each word w 
belonging to t whether tu is a key in the hash table (this can 
be done in constant time). If w is a key, then we have to 
update the values associated to w, i.e., we add t to the set 
of terms containing w. Otherwise, we add the new key w 
and the associated term t to the hash table. Therefore, it can 
be easily verified that the voting procedure requires in the 
worst case 0{nm) steps, when a word belongs to all the LLT 
terms. The computation of criteria-related weights requires 
0{n) time units; the complexity of multi-value sorting can be 


approximated to 0{nlogn) time units (since the number of the 
criteria-related weights involved in the multi-sorting is fixed 
to be 5). Finally, deriving the best solutions actually requires 
0{nl) steps. 

The computational complexity of MagiCoder is likely to 
be lower than that of the tool proposed in & Indeed, in 
m the author describes a sophisticated procedure which 
considers also approximate string matching. This feature does 
not allow constant time search for text-dictionary matches 
(i.e., it is not always possible to exploit direct data access 
through optimal data structures, such as hash tables). Moreover, 
explicitly considering word permutation and combination is a 
computationally expensive task. We claim that the efficiency 
of MagiCoder can be preserved also extending it with more 











advanced features, such as recognition of words in presence 
of orthographical errors. As a future work, we plan to provide 
formal and experimental comparisons of performances of 
MagiCoder with respect to the software proposed in 

IV. Software Implementation and Testing 

A. The User Interface 

MagiCoder has been implemented as a VigiWork plug¬ 
in: people responsible for pharmacovigilance can extract the 
auto-encodlng of the narrative description and then revise and 
validate it. Figure shows a screenshot of VigiWork for 
the part supporting back-end tasks (done by responsibles for 
pharmacovigilance revision activities). In the high part of the 
screen it is possible to observe the five sections composing a 
report. The screenshot actually shows the result of a human- 
MagiCoder interaction: by pressing the button “Autocodifica 
in MedDRA” (in English, “MedDRA autoencoding”), the re¬ 
sponsible for pharmacovigilance obtains a MedDRA encoding 
for the natural language ADR in the field “Descrizione” (in 
English, “Description”). Six solutions are proposed as the best 
MedDRA term candidates: the responsible can refuse a term 
(through the trash icon), change one or more terms (by an option 
menu), or simply validate the automatic encoding and switch 
to the next section “Earmaci” (in English, “Drugs”). We are 
testing MagiCoder performance in the daily pharmacovigilance 
activities. Preliminary qualitative results show that MagiCoder 
drastically reduces the amount of work required for the revision 
of a report, allowing the pharmacovigilance stakeholders to 
provide high quality data about suspected ADRs. 

B. Testing 

As a preliminary step in evaluating MagiCoder perfor¬ 
mances, we developed a benchmark, which automatically 
compares MagiCoder behavior with human encoding on 
already manually revised and validated ADR reports. 

To this end, we exploited VigiSegn, a data warehouse 
and OLAP system for the Italian pharmacovigilance activities 
developed for the Italian Pharmacovigilance National Center 
]23| . This system is based on the open source business 
intelligence suite Pentaho. VigiSegn offers a large number of 
encoded ADRs. The encoding has been manually performed 
and validated by experts working at pharmacovigilance centres. 
Encoding results have then been sent to the national regulatory 
authority AIFA. 

We performed a test, composed by the following steps. 

1) We launch an ETL procedure through Penthao Data 
Integrator. The procedure transfers reports from VigiSegn 
to an ad-hoc database TestDB. The dataset covers all the 
6780 reports received, revised, and validated during the 
year 2014 for the Italian region Veneto. 

2) We launch an ETL procedure which extracts from re¬ 
ports stored in TestDB the narrative descriptions. Eor 
each description, the procedure calls MagiCoder from 
VigiWork; the output, i.e., a list of MedDRA terms, is 
stored in a table of TestDB. 


3) Manual and automatic solutions, i.e., LLT term sets, are 
finally compared through an SQL query. We compute how 
much manual solutions are “covered” by MagiCoder. In 
other words, we perform a similarity test between the 
two output sets. In order to have two uniform data sets, 
we map each LLT term, both from the manual and the 
automatic solutions, to its corresponding preferred term. 

Table |n] shows the results of this first performance test. 

It is worth noting that this test simply estimates how much 
MagiCoder behavior is similar to the manual work on the 
whole set of solutions, without considering the quality of the 
manual encoding. We may observe that for short descriptions 
MagiCoder results are very close to those from manual 
encoding. The percentage of similarity decreases with the 
growing of the number of characters, but it is stable beyond a 
certain threshold. It could suggest that MagiCoder will behave 
very well on very long (intractable) descriptions: as a human 
reviewer, the procedure does not encode redundant text. Since 
we did not evaluate the quality of the human solutions we take 
into account, we are working on a further quantitative analysis 
of MagiCoder performances. We are developing an experi¬ 
mental test, involving three experts in report revision. Two 
experts (and a third one, in case a reconciliation of diverging 
encoding is needed) are manually encoding a representative 
sample of ADR descriptions (about 200), in order to build 
a ground truth data set. These “certified” manual solutions 
will be compared, report by report, with MagiCoder’s outputs. 
The test has been designed to effectively measure soundness 
and completeness of MagiCoder. Informally, soundness can 
be estimated with respect to false positive terms provided 
by MagiCoder; completeness can be estimated according to 
LLT terms omitted by MagiCoder. We will precisely quantify 
the difference between the human and the automatic encoding 
(taking into account also syntactically different but semantically 
equivalent solutions) and, thus, we will be able to compute the 
standard deviation of the behavior of the procedure w.r.t. the 
expected performance. 

C. Examples 

Table |I^ provides some examples of the behavior of 
MagiCoder. We propose some free-text ADR descriptions from 
TestDB and we provide both the manual and the automatic 
encodings into LLT terms. We also provide the English 
translation of the natural language text (we actually provide a 
quite straightforward literal translation). 

Dl- anaphylactic shock (hypotension + cutaneous rash) 1 hour 
after taking the drug. 

D2- swelling in vaccination location left from 11/5; 
temperature less than 39,5 from 11/21; vesicles, blisters 
around the cheek from 11/10. 

D3- extended local reaction, local pain, headache, fever for 
two days. 


In Table 


III 


we use the following notations: fi" and ^ 2 ” 
are two identical LLT terms retrieved both by the human 
and the automatic encoding; fi" and ^ 2 ” are two semantically 
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Fig. 3. A partial screenshot of VigiWork User Interface 


Length of the Description {# chars) 

Percentage of global identical solutions at the PT level 

Short descriptions (up to 20 chars) 

81% 

Short/medium descriptions (from 20 up to 40 chars) 

62% 

Medium descriptions (from 40 up to 100 chars) 

62% 

Long descriptions (from 100 up to 250 chars) 

61% 


TABLE II 

First results of MagiCoder performances 


equivalent or similar LLT terms retrieved by the human and the 
automatic encoding, respectively; we use bold type to denote 
terms that are recognized by MagiCoder and that have not 
been encoded by the reviewer; we use italic type in Dl, D2, D3, 
to denote text recognized only by MagiCoder. For example, in 
description D3, “cefalea” (in English, “headache”) is retrieved 
and codified both by the human reviewer and MagiCoder; in 
description D2, ADR “febbre” (in English, “fever’) has been 
codified with the term itself by the algorithm, whereas the 
reviewer codified it with its synonym “piressia”; in Dl, ADR 
“ipotensione” (in English, “hypotension”), has been retrieved 
only by MagiCoder. 

V. Conclusions and Euture Work 

In this paper we propose MagiCoder, a simple and efficient 
NLP software, able to provide a concrete support to pharma- 
covigilance task, in the revision of ADR spontaneous reports. 
MagiCoder takes in input a narrative description of a suspected 
ADR and produces as outcome a list of MedDRA terms that 
“cover” the medical meaning of the free-text description. We 


presented and implemented the first version of the algorithm, 
and preliminary results about its performances are encouraging. 

Finally, let us sketch here some ongoing and future work. 
First, we aim to prove that MagiCoder is robust with respect 
to language (and dictionary) changes. The way the algorithm 
has been developed suggests that MagiCoder can be a valid 
tool also for narrative descriptions written in English. Indeed, 
the algorithm retrieves a set of words, which covers an LLT 
term t, from a free-text description, without considering the 
order between words or the structure of the sentence. This 
way, we avoid the problem of “specializing” MagiCoder for 
any given language. Furthermore, MagiCoder performances 
can be strengthened, still maintaining the simple “skeleton” 
we proposed, eventually embedding new feamres inspired to 
advanced NLP techniques. Even though negative sentences 
seem to be uncommon in ADR descriptions (at least in the 
data set we analyzed), the detection of negative forms is 
a short-term issue we aim to address. As a first step, we 
plan to recognize words that may represent negations and 
to signal them to the reviewer through the graphical UI. In 























# 

Narrative Description 

LLT Human Encoding 

LLT MagiCoder Encoding 

D1 

Shock anafilattico (ipotensione + rash cutaneo) 

1 h dopo assunzione x os del farmaco 

Shock anafilattico^ 

Ipotensione. Shock anafilattico^ 

D2 

gonfiore in sede di vaccinazione sx dal 5/11, 
febbre meno di 39,5 dal 21/11, 
vescicole, bolle presso la guancia dal 10/11 

Gonfiore in sede di vaccinazione^. 

Bolle, Febbre", Gonfiore in sede di vaccinazione^. 

Piressia", Vescicole^ 

3 

Vescicole in sede di vaccinazione 

D3 

Reazione locale estesa, dolore locale; 
cefalea e febbre per due giomi 

Cefalea^, Eebbre^, 

Cefalea^. Dolore, Eebbre^, 

Reazione in sede di vaccinazione^ 

Reazione locale^ 


TABLE III 

Examples of MagiCoder behavior 


this way, the software sends to the report reviewer an alert 
abont the (possible) failnre of the syntactical word-by-word 
recognition. Moreover, we plan to address the management 
of orthographical errors possibly contained in narrative ADR 
descriptions. We did not take into acconnt this issne in the 
current version of MagiCoder. A solution could be including an 
ad-hoc (medical term-oriented) spell checker in VigiWork, to 
point out to the user that she/he is doing some error in writing 
the cnrrent word in the free description field. This shonld 
drastically rednce nsers’ orthographical errors withont heavy 
side effects in MagiCoder development and performances. 
As a farther extension of MagiCoder, we will enrich the 
algorithm with heuristics and synonyms dictionaries. Moving 
towards the nse of ad-hoc thesanrns dictionaries, onr idea 
is to progressively (throngh everyday learning and feedback 
coming from experience) extend MedDRA with synonyms of 
LLT terms. Finally, we aim to apply MagiCoder (and its 
refinements) to several different sonrces for ADR detection, 
such as, for example, drug information leaflets. 
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